home *** CD-ROM | disk | FTP | other *** search
Text File | 1991-08-08 | 73.8 KB | 1,979 lines | [TEXT/MSWD] |
-
-
-
- Clustal V Multiple Sequence Alignments.
-
- Documentation (Installation and Usage).
-
- Des Higgins
- European Molecular Biology Laboratory
- Postfach 10.2209
- D-6900 Heidelberg
- Germany.
-
- higgins@EMBL-Heidelberg.DE
-
-
- *******************************************************************
-
-
- Contents.
-
-
- 1 Overview
-
- 2 Installation
-
- 3 Interactive usage
-
- 4 Command-line interface
-
- 5 Algorithms and references
-
-
- *******************************************************************
-
- 1. Overview
-
- This document describes how to install and use ClustalV on various
- machines. ClustalV is a complete upgrade and rewrite of the Clustal
- package of multiple alignment programs (Higgins and Sharp, 1988 and
- 1989). The original programs were written in Fortran for
- microcomputers running MSDOS. You carried out a complete alignment
- by running 3 programs in succession. Later, these were merged into
- a single menu driven program with on-line help, for VAX/VMS.
- ClustalV was written in C and has all of the features of the old
- programs plus many new ones. It has been compiled and tested using
- VAX/VMS C, Decstation ULTRIX C, Gnu C for Sun workstations, Turbo C
- for IBM PC's and Think C for Apple Mac's. The original Clustal was
- written by Des Higgins while he was a Post-Doc in the lab of Paul
- Sharp in the Genetics Department, Trinity College, Dublin 2,
- Ireland.
-
- The main feature of the old package was the ability to carry out
- reliable multiple alignments of many sequences. The sensitivity of
- the program is as good as from any other program we have tried, with
- the exception of the programs of Vingron and Argos (1991), while it
- works in reasonable time on a microcomputer. The programs of
- Vingron and Argos are specialised for finding distant similarities
- between proteins but require mainframes or workstations and are more
- difficult to use.
-
- The main new features are: profile alignments (alignments of old
- alignments); phylogenetic trees (Neighbor Joining trees calculated
- after multiple alignment with a bootstrapping option); better
- sequence input (automatically recognise and read NBRF/PIR, Pearson
- (Fasta) or EMBL/SwissProt formats); flexible alignment output
- (choose one of: old Clustal format, NBRF/PIR, GCG msf format or
- Phylip format); full command line interface (everything that you can
- do interactively can be specified on the command line).
-
- In version 7 of the GCG package, there is a program called PILEUP
- which uses a very similar algorithm to the one in ClustalV. There
- are 2 main differences between the programs: 1) the metric used to
- compare the sequences for the initial "guide tree" uses a full
- global, optimal alignment in PILEUP instead of the fast, approximate
- ones in ClustalV. This makes PILEUP much slower for the comparison
- of long sequences. In principle, the distances calculated from
- PILEUP will be more sensitive than ours, but in practice it will not
- make much difference, except in difficult cases. 2) During the
- multiple alignment, terminal gaps are penalised in ClustalV but not
- in PILEUP. This will make the PILEUP alignments better when the
- sequences are of very different lengths (has no effect if there are
- no large terminal gaps).
-
-
- This software may be distributed and used freely, provided that you
- do not modify it or this documentation in any way without the
- permission of the authors.
-
- If you wish to refer to ClustalV, please cite:
- Higgins,D.G. Bleasby,A.J. and Fuchs,R. (1991) CLUSTAL V: improved software
- for multiple sequence alignment. ms. submitted to CABIOS.
-
- The overall multiple alignment algorithm was described in:
- Higgins,D.G. and Sharp,P.M. (1989). Fast and sensitive multiple
- sequence alignments on a microcomputer. CABIOS, vol. 5, 151-153.
-
-
- ACKNOWLEDGEMENTS.
-
- D.H. would particularly like to thank Paul Sharp, in whose lab. this
- work originated. We also thank Manolo Gouy, Gene Myers, Peter Rice
- and Martin Vingron for suggestions, bug-fixes and help.
-
- Des Higgins and Rainer Fuchs,
- EMBL Data Library, Heidelberg, Germany.
-
- Alan Bleasby,
- Daresbury, UK.
-
- JUNE 1991
- *******************************************************************
-
- 2. Installation.
-
-
-
- As far as possible, we have tried to make ClustalV portable to any
- machine with a standard C compiler (proposed ANSI C standard). The
- source code, as supplied by us, has been compiled and tested using
- the following compilers:
-
- VAX/VMS C
- Ultrix C (on a Decstation 2100)
- Gnu C on a Sun 4 workstation
- Think C on an Apple Macintosh SE
- Turbo C on an IBM AT.
-
- In each case, one must make 1 change to 1 line of code in 1 header
- file. This is described below. The exact capacity of the program
- (how many sequences of what length can be aligned) will depend of
- course on available memory but can also be set in this header file.
-
- The package comes as 9 C source files; 3 header files; 1 file of on-
- line help; this documentation file; 3 make files:
-
- Source code: clustalv.c, amenu.c, gcgcheck.c, myers.c, sequence.c,
- showpair.c, trees.c, upgma.c, util.c
-
- Header files: clustalv.h, general.h, matrices.h
-
- On-Line help: clustalv.hlp (must be renamed or defined as
- clustalv_help except on PC's)
-
- Documentation: clustalv.doc (this file).
-
- Makefiles: makefile.sun (gnu c on Sun), vmslink.com (vax/vms),
- makefile.ult (ultrix).
-
-
-
-
-
-
-
- Before compiling ClustalV you must look at and possibly change
- clustalV.h, shown below..
-
- /*******************CLUSTALV.H********************************/
-
- /*
- Main header file for ClustalV. Uncomment ONE of the following lines
- depending on which compiler you wish to use.
- */
-
- #define VMS 1 /* VAX VMS */
-
- /*#define MAC 1 Think_C for MacIntosh */
-
- /*#define MSDOS 1 Turbo C for PC's */
-
- /*#define UNIX 1 Ultrix for Decstations or Gnu C for Sun */
-
- /*************************************************************/
-
- #include "general.h"
-
- #define MAXNAMES 10
- #define MAXTITLES 60
- #define FILENAMELEN 256
-
- #define UNKNOWN 0
- #define EMBLSWISS 1
- #define PIR 2
- #define PEARSON 3
-
- #define PAGE_LEN 22
-
- #if VMS
- #define DIRDELIM ']'
- #define MAXLEN 3000
- #define MAXN 150
- #define FSIZE 15000
- #define LINELENGTH 60
- #define GCG_LINELENGTH 50
-
- #elif MAC
- #define DIRDELIM ':'
- #define MAXLEN 2600
- #define MAXN 30
- #define FSIZE 10000
- #define LINELENGTH 50
- #define GCG_LINELENGTH 50
-
- #elif MSDOS
- #define DIRDELIM '\\'
- #define MAXLEN 1300
- #define MAXN 30
- #define FSIZE 5000
- #define LINELENGTH 50
- #define GCG_LINELENGTH 50
-
- #elif UNIX
- #define DIRDELIM '/'
- #define MAXLEN 3000
- #define MAXN 50
- #define FSIZE 15000
- #define LINELENGTH 60
- #define GCG_LINELENGTH 50
- #endif
- /*****************end*of*CLUSTALV.H***************************/
-
-
-
- First, you must remove the comments from one of the first 10 lines.
- There are 4 'define' compiler directives here (e.g. #define VMS 1),
- and you should use one of these, depending on which system you wish
- to work. So choose one of these, remove its comments (if it is
- already commented out) and put comments around any of the others
- that are still active. If you wish to use a different system, you
- will need to insert a new line with a new keyword (which you must
- invent) to identify your system. Most of the rest of this header
- file is taken up with a block of 'define' statements for each system
- type; e.g. the VAX/VMS block is:
-
- #if VMS
- #define DIRDELIM ']'
- #define MAXLEN 3000
- #define MAXN 150
- #define FSIZE 15000
- #define LINELENGTH 60
- #define GCG_LINELENGTH 50
-
- In this block, you can specify the maximum number of sequences to be
- allowed (MAXN); the maximum sequence length, including gaps
- (MAXLEN); FSIZE declares the size of some workspace, used by the
- fast 2 sequence comparison routines and should be APPROXIMATELY 4
- times MAXLEN; LINELENGTH is the length of the blocks of alignment
- output in the output files; GCG_LINELENGTH is the same but for the
- GCG compatible output only. Finally, DIRDELIM is the character used
- to specify directories and subdirectories in file names. It should
- be the character used to seperate the file name itself from the
- directory name (e.g. in VMS, file names are like:
- $drive:[dir1.dir2.dir3]filename.ext;2 so ']' is used as DIRDELIM).
-
- So, if you want to use a system, not covered in Clustalv.h, you will
- have to insert a new block, like the above one. To compile and link
- the program, we supply 3 makefiles: one each for VAX/VMS, Ultrix
- and GNU C for Sun workstations.
-
-
-
- VAX/VMS
-
- Compile and link the program with the
- supplied makefile for vms: vmslink.com .
-
- $ @vmslink
-
- This will produce clustalv.exe (and a lot of .obj files which you can delete).
-
- The on-line help file (clustalv.hlp) should be 'defined' as
- clustalv_help as follows:
-
- $ def clustalv_help $drive:[dir1.dir2]clustalv.hlp
-
- where $drive is the drive designation and [dir1.dir2] is the
- directory where clustalv.hlp is kept.
-
- To make use of the command-line interface, you must make clustalv a
- 'foreign' command with:
-
- $ clustalv :== $$drive:[dir1.dir2]clustalv
-
- where $drive is the drive designation and [dir1.dir2] is the
- directory where clustalv.exe is kept.
-
-
-
- IBM PC/MSDOS/TURBO C
-
- Create a makefile (something.prj) with the names of the source files
- (clustalv.c, amenu.c etc.) and 'make' this using the HUGE memory
- model. You will get half a dozen warnings from the compiler about
- pieces of code than look suspicious to it but ignore these. The
- help file should remain as clustalv.hlp . To run the program using
- the default settings in Clustalv.h, you need approximately 500k of
- memory. To reduce this, the main influence on memory usage is the
- parameter MAXLEN; reduce MAXLEN to reduce memory usage.
-
-
-
- Apple Mac/THINK_C version 4.0.2
-
- This version of the program is not at all Mac like. It runs in a
- window, the inside of which looks just like a normal character based
- terminal. In the future we might put a proper Mac interface on it
- but do not have the time right now. With the default settings in
- the header file ClustalV.h, you need just over 800k of memory to run
- the program. To reduce this, reduce MAXLEN; this is easily the
- biggest influence on memory usage. To compile the program and save
- it as an application you need to 'set the application type'; here
- you specify how much memory (in kilobytes (k)) the application will
- need. You should set this to 900k to run the application as it is
- OR reduce MAXLEN in the header. To compile the program you have to
- create a 'project'; you 'add' the names of the 9 source files to the
- project AND the name of the ANSI library. The source code is too
- large to compile in one compilation unit. You will get a 'link
- error: code segment too big' if you try to compile and link as is.
- You should compile amenu.c (the biggest source file) as a seperate
- unit ..... you will have to read the manual/ask someone/mail me to
- find out what this is.
-
-
- *******************************************************************
-
- 3. Interactive usage.
-
-
-
- Interactive usage of Clustal V is completely menu driven. On-line
- help is provided, defaults are offered for all parameters and file
- names. With a little effort it should be completely self
- explanatory. The main menu, which appears when you run the
- programs is shown below. Each item brings you to a sub menu.
-
-
-
- Main menu for Clustal V:
-
-
- 1. Sequence Input From Disc
- 2. Multiple Alignments
- 3. Profile Alignments
- 4. Phylogenetic trees
-
- S. Execute a system command
- H. HELP
- X. EXIT (leave program)
-
-
- Your choice:
-
-
-
- The options S and H appear on all the main menus. H will provide
- help and if you type S you will be asked to enter a command, such as
- DIR or LS, which will be sent to the system (does not work on
- Mac's). Before carrying out an alignment, you must use option 1
- (sequence input); the format for sequences is explained below.
- Under menu item 2 you will be able to automatically align your
- sequences to each other. Menu item 3 allows you to do profile
- alignments. These are alignments of old alignments. This allows
- you to build up a multiple alignment in stages or add a new sequence
- to an old alignment. You can calculate phylogenetic trees from
- alignments using menu item 4.
-
-
-
-
- ******************************
- * SEQUENCE INPUT. *
- ******************************
-
-
- All sequences should be in 1 file. Three formats are automatically
- recognised and used: NBRF/PIR, EMBL/SwissProt and FASTA (Pearson and
- Lipman (1988) format).
-
- ***
- Users of the Wisconsin GCG package should use the command TONBRF
- (recently changed to TOPIR) to reformat their sequences before use.
- ***
-
- Sequences can be in upper or lower case. For proteins, the only
- symbols recognised are: A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y and
- for DNA/RNA use: A,C,G and T (or U). Any other letters of the
- alphabet will be treated as X (proteins) or N (DNA/RNA) for unknown.
- All other symbols (blanks, digits etc.) will be ignored EXCEPT for
- the hyphen "-" which can be used to specify a gap. This last point
- is especially useful for 2 reasons: 1) you can fix the positions of
- some gaps in advance; 2) the alignment output from this program can
- be written out in NBRF format using "-"'s to specify gaps; these
- alignments can be used again as input, either for profile alignments
- or for phylogenetic trees.
-
- If you are using an editor to create sequence files, use the FASTA
- format as it is by far the simplest (see below). If you have access
- to utility programs for generating/converting the NBRF/PIR format
- then use it in preference.
-
-
-
- FASTA (PEARSON AND LIPMAN, 1988) FORMAT: The sequences are
- delimited by an angle bracket ">" in column 1. The text immediately
- after the ">" is used as a title. Everything on the following line
- until the next ">" or the end of the file is one sequence.
-
- e.g.
-
- > RABSTOUT rabbit Guinness receptor
- LKMHLMGHLKMGLKMGLKGMHLMHLKHMHLMTYTYTTYRRWPLWMWLPDFGHAS
- ADSCVCAHGFAVCACFAHFDVCFGAVCFHAVCFAHVCFAAAVCFAVCAC
- > MUSNOSE mouse nose drying factor
- mhkmmhkgmkhmhgmhmhglhmkmhlkmgkhmgkmkytytytryrwtqtqwtwyt
- fdgfdsgafdagfdgfsagdfavdfdvgavfsvfgvdfsvdgvagvfdv
- > HSHEAVEN human Guinness receptor repeat
- mhkmmhkgmkhmhgmhmhg lhmkmhlkmgkhmgkmk ytytytryrwtqtqwtwyt
- fdgfdsgafdagfdgfsag dfavdfdvgavfsvfgv dfsvdgvagvfdv
- mhkmmhkgmkhmhgmhmhg lhmkmhlkmgkhmgkmk ytytytryrwtqtqwtwyt
- fdgfdsgafdagfdgfsag dfavdfdvgavfsvfgv dfsvdgvagvfdv
-
-
-
- NBRF/PIR FORMAT is similar to FASTA format but immediately
- after the ">", you find the characters "P1;" if the sequences are
- protein or "DL;" if they are nucleic acid. Clustalv looks for the
- ";" character as the third character after the ">". If it finds one
- it assumes that the format is NBRF if not, FASTA format is assumed.
- The text after the ";" is treated as a sequence name while the
- entire next line is treated as a title. The sequence is terminated
- by a star "*" and the next sequence can then begin (with a >P1; etc
- ). This is just the basic format description (there are other
- variations and rules).
-
- ANY files/sequences in GCG format can be converted to this format
- using the TONBRF command (now TOPIR) of the Wisconsin GCG package.
-
-
- e.g.
-
- >P1;RABSTOUT
- rabbit Guinness receptor
- LKMHLMGHLKMGLKMGLKGMHLMHLKHMHLMTYTYTTYRRWPLWMWLPDFGHAS
- ADSCVCAHGFAVCACFAHFDVCFGAVCFHAVCFAHVCFAAAVCFAVCAC*
- >P1;MUSNOSE
- mouse nose drying factor
- mhkmmhkgmkhmhgmhmhglhmkmhlkmgkhmgkmkytytytryrwtqtqwtwyt
- fdgfdsgafdagfdgfsagdfavdfdvgavfsvfgvdfsvdgvagvfd
- *
- >P1;HSHEAVEN
- human Guinness receptor repeat protein.
- mhkmmhkgmkhmhgmhmhg lhmkmhlkmgkhmgkmk ytytytryrwtqtqwtwyt
- fdgfdsgafdagfdgfsag dfavdfdvgavfsvfgv dfsvdgvagvfdv
- mhkmmhkgmkhmhgmhmhg lhmkmhlkmgkhmgkmk ytytytryrwtqtqwtwyt
- fdgfdsgafdagfdgfsag dfavdfdvgavfsvfgv dfsvdgvagvfdv*
-
-
-
-
- EMBL/SWISSPROT FORMAT: Do not try to create files with this
- format unless you have utilities to help. If you are just using an
- editor, use one of the above formats. If you do use this format,
- the program will ignore everything between the ID line (line
- beginning with the characters "ID") and the SQ line. The sequence
- is then read from between the SQ line and the "//" characters.
-
-
-
- It is critically important for the program to know whether or not it
- is aligning DNA or protein sequences. The input routines attempt to
- guess which type of sequence is being used by counting the number of
- A,C,G,T or U's in the sequences. If the total is more than 85% of
- the sequence length then DNA is assumed. If you use very bizarre
- sequences (proteins with really strange aa compositions or DNA
- sequences with loads of strange ambiguity codes) you might confuse
- the program. It is difficult to do but be careful.
-
-
-
-
-
- ******************************
- * MULTIPLE ALIGNMENT MENU. *
- ******************************
-
- The multiple alignment menu is shown below. Before explaining how
- to use it, you must be introduced briefly to the alignment strategy.
- If you do not follow this, try using option 1 anyway; the entire
- process will be carried out automatically.
-
- To do a complete multiple alignment, we need to know the approximate
- relationships of the sequences to each other (which ones are most
- similar to each other). We do this by calculating a crude
- phylogenetic tree which we call a dendrogram (to distinguish it from
- the more sensitive trees available under the phylogenetic tree
- menu). This dendrogram is used as a guide to align bigger and
- bigger groups of sequences during the multiple alignment. The
- dendrogram is calculated in 2 stages: 1) all pairs of sequence are
- compared using the fast/approximate method of Wilbur and Lipman
- (1983); the result of each comparison is a similarity score. 2) the
- similarity scores are used to construct the dendrogram using the
- UPGMA cluster analysis method of Sneath and Sokal (1973).
-
- The construction of the dendrogram can be very time consuming if you
- wish to align many sequences (e.g. for 100 sequences you need to
- carry out 100x99/2 sequence comparisons = 4950). During every
- multiple alignment, a dendrogram is constructed and saved to a file
- (something.dnd). These can be reused later.
-
-
-
-
-
-
-
-
- ******Multiple*Alignment*Menu******
-
-
- 1. Do complete multiple alignment now
- 2. Produce dendrogram file only
- 3. Use old dendrogram file
- 4. Pairwise alignment parameters
- 5. Multiple alignment parameters
- 6. Output format options
-
- S. Execute a system command
- H. HELP
- or press [RETURN] to go back to main menu
-
-
- Your choice:
-
-
- So, if in doubt, and you have already loaded some sequences from the
- main menu, just try option 1 and press the <Return> key in response
- to any questions. You will be prompted for 2 file names e.g. if the
- sequence input file was called DRINK.PEP, you will be offered
- DRINK.ALN as the file to contain the alignment and DRINK.DND for the
- dendrogram.
-
- If you wish to repeat a multiple alignment (e.g. to experiment with
- different gap penalties) but do not wish to make a dendrogram all
- over again use menu item 3(providing you are using the same
- sequences). Similarly, menu item 2 allows you to produce the
- dendrogram file only.
-
-
-
-
- PAIRWISE ALIGNMENT PARAMETERS:
-
- The parameters that control the initial fast/approximate comparisons
- can be set from menu item 4 which looks like:
-
-
- ********* WILBUR/LIPMAN PAIRWISE ALIGNMENT PARAMETERS *********
-
-
- 1. Toggle Scoring Method :Percentage
- 2. Gap Penalty :3
- 3. K-tuple :1
- 4. No. of top diagonals :5
- 5. Window size :5
-
- H. HELP
-
-
- Enter number (or [RETURN] to exit):
-
-
-
- The similarity scores are calculated from fast alignments generated
- by the method of Wilbur and Lipman (1983). These are 'hash' or
- 'word' or 'k-tuple' alignments carried out in 3 stages.
-
- First you mark the positions of every fragment of sequence, K-tuple
- long (for proteins, the default length is 1 residue, for DNA it is 2
- bases) in both sequences. Then you locate all k-tuple matches
- between the 2 sequences. At this stage you have to imagine a dot-
- matrix plot between the 2 sequences with each k-tuple match as a
- dot. You find those diagonals in the plot with most matches (you
- take the "No. of top diagonals" best ones) and mark all diagonals
- within "Window size" of each top diagonal. This process will define
- diagonal bands in the plot where you hope the most likely regions of
- similarity will lie.
-
- The final alignment stage is to find that head to tail arrangement
- of k-tuple matches from these diagonal regions that will give the
- highest score. The score is calculated as the number of exactly
- matching residues in this alignment minus a "gap penalty" for every
- gap that was introduced. When you toggle "Scoring method" you
- choose between expressing these similarity scores as raw scores or
- expressed as a percentage of the shorter sequence length.
-
- K-TUPLE SIZE: Can be 1 or 2 for proteins; 1 to 4 for DNA.
- Increase this to increase speed; decrease to improve sensitivity.
-
- GAP PENALTY: The number of matching residues that must be found
- in order to introduce a gap. This should be larger than K-Tuple
- Size. This has little effect on speed or sensitivity.
-
- NO. OF TOP DIAGONALS: The number of best diagonals in the
- imaginary dot-matrix plot that are considered. Decrease (must be
- greater than zero) to increase speed; increase to improve
- sensitivity.
-
- WINDOW SIZE: The number of diagonals around each "top" diagonal
- that are considered. Decrease for speed; increase for greater
- sensitivity.
-
- SCORING METHOD: The similarity scores may be expressed as raw scores
- (number of identical residues minus a "gap penalty" for each gap) or
- as percentage scores. If the sequences are of very different
- lengths, percentage scores make more sense.
-
-
-
- CHANGING THE PAIRWISE ALIGNMENT PARAMETERS
-
- The main reason for wanting to change the above parameters is SPEED
- (especially on microcomputers), NOT SENSITIVITY. The dendrograms
- that are produced can only show the relationships between the
- sequences APPROXIMATELY because the similarity scores are calculated
- from seperate pairwise alignments; not from a multiple alignment
- (that is what we eventually hope to produce). If the groupings of
- the sequences are "obvious", the above method should work well; if
- the relationships are obscure or weakly represented by the data, it
- will not make much difference playing with the parameters. The main
- factor influencing speed is the K-TUPLE SIZE followed by the WINDOW
- SIZE.
-
- The alignments are carried out in a small amount of memory.
- Occasionally (it is hard to predict), you will run out of memory
- while doing these alignments; when this happens, it will say on the
- screen: "Sequences (a,b) partially aligned" (instead of "Sequences
- (a,b) aligned"). This means that the alignment score for these
- sequences will be approximate; it is not a problem unless many of
- the alignments do this. It can be fixed by using less sensitive
- parameters or increasing parameter FSIZE in clustalv.h .
-
-
- THE DENDROGRAM ITSELF
-
- The similarity scores generated by the fast comparison of all the
- sequences are used to construct a dendrogram by the UPGMA method of
- Sneath and Sokal (1973). This is a form of cluster analysis and the
- end result produces something that looks like a tree. It represents
- the similarity of the sequences as a hierarchy. The dendrogram is
- written to a file in a machine readable format and is ahown below
- for an example with 6 sequences.
-
-
- 91.0 0 0 2 012000 ! seq 2 joins seq 3 at 91% ID.
- 72.0 1 0 3 011200 ! seq 4 joins seqs 2,3 at 72%
- 71.1 0 0 2 000012 ! seq 5 joins seq 6 at 71%
- 35.5 0 2 4 122200 ! seq 1 joins seqs 2,3,4
- 21.7 4 3 6 111122 ! seqs 1,2,3,4 join seqs 5,6
-
- This LOOKS complicated but you do not normally need to care what is
- in here. Anyway, each row represents the joining together of 2 or
- more sequences. You progress from the top down, joining more and
- more sequences until all are joined together; for N sequences you
- have N-1 groupings hence there are 5 rows in the above file (there
- were 6 sequences). In each row, the first number is the similarity
- score of this grouping; ignore the next three columns for the
- moment; the last 6 digits in the line show which sequences are
- grouped; there is one digit for each sequence (the first digit is
- for the first sequence). The rule is: in each row, all of the "1"s
- join all of the "2"s; the zero's do nothing.
-
- Hence, in the first row, sequence 2 joins sequence 3 at a similarity
- level of 91% identity; next, sequence 4 joins the previous grouping
- of 2 plus 3 at a level of 72% etc. This is shown diagrammatically
- below. Before leaving the dendrogram format, the other 3 columns of
- numbers are: a pointer to the row from which the "1" sequences were
- last joined (or zero if only one of them); a pointer to the row in
- which the "2"s were last joined; the total number of sequences
- joined in this line.
-
-
-
-
- I------ 2
- I------I
- I I------ 3 Diagram of the sequence similarity
- I----I
- I I------------- 4 relationships shown in the above
- I--I
- I I------------------ 1 dendrogram file (branch lengths are
- ----I
- I I------------- 5 not to scale).
- I-------I
- I------------- 6
-
-
-
-
-
-
-
-
-
- MULTIPLE ALIGNMENT PARAMETERS:
-
-
- Having calculated a dendrogram between a set of sequences, the final
- multiple alignment is carried out by a series of alignments of
- larger and larger groups of sequences. The order is determined by
- the dendrogram so that the most similar sequences get aligned first.
- Any gaps that are introduced in the early alignments are fixed.
- When two groups of sequences are aligned against each other, a full
- protein weight matrix (such as a Dayhoff PAM 250) is used. Two gap
- penalties are offered: a "FIXED" penalty for opening up a gap and a
- "FLOATING" penalty for extending a gap.
-
-
- ********* MULTIPLE ALIGNMENT PARAMETERS *********
-
-
- 1. Fixed Gap Penalty :10
- 2. Floating Gap Penalty :10
- 3. Toggle Transitions (DNA):Weighted
- 4. Protein weight matrix :PAM 250
-
- H. HELP
-
-
- Enter number (or [RETURN] to exit):
-
-
- FIXED GAP PENALTY: Reduce this to encourage gaps of all sizes;
- increase it to discourage them. Terminal gaps are penalised same
- as all others. BEWARE of making this too small (approx 5 or so); if
- the penalty is too small, the program may prefer to align each
- sequence opposite one long gap.
-
- FLOATING GAP PENALTY: Reduce this to encourage longer gaps;
- increase it to shorten them. Terminal gaps are penalised same as
- all others. BEWARE of making this too small (approx 5 or so); if
- the penalty is too small, the program may prefer to align each
- sequence opposite one long gap.
-
-
- DNA TRANSITIONS = WEIGHTED or UNWEIGHTED: By default, transitions
- (A versus G; C versus T) are weighted more strongly than
- transversions (an A aligned with a G will be preferred to an A
- aligned with a C or a T). You can make all pairs of nucleotide
- equally weighted with this option.
-
- PROTEIN WEIGHT MATRIX: For protein comparisons, a weight matrix is
- used to differentially weight different pairs of aligned amino
- acids. The default is the well known Dayhoff PAM 250 matrix. We
- also offer a PAM 100 matrix, an identity matrix (all weights are the
- same for exact matches) or allow you to give the name of a file with
- your own matrix. The weight matrices used by Clustal V are shown in
- full in the Algorithms and References section of this documentation.
-
- If you input a matrix from a file, it must be in the following
- format. Use a 20x20 matrix only (entries for the 20 "normal" amino
- acids only; no ambiguity codes etc.). Input the lower left triangle
- of the matrix, INCLUDING the diagonal. The order of the amino acids
- (rows and columns) must be: CSTPAGNDEQHRKMILVFYW. The values can be
- in free format seperated by spaces (not commas). The PAM 250 matrix
- is shown below in this format.
-
- 12
- 0 2
- -2 1 3
- -3 1 0 6
- -2 1 1 1 2
- -3 1 0 -1 1 5
- -4 1 0 -1 0 0 2
- -5 0 0 -1 0 1 2 4
- -5 0 0 -1 0 0 1 3 4
- -5 -1 -1 0 0 -1 1 2 2 4
- -3 -1 -1 0 -1 -2 2 1 1 3 6
- -4 0 -1 0 -2 -3 0 -1 -1 1 2 6
- -5 0 0 -1 -1 -2 1 0 0 1 0 3 5
- -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6
- -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5
- -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6
- -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4
- -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9
- 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10
- -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17
-
- Values must be integers and can be all positive or positive and
- negative as above. These are SIMILARITY values.
-
-
-
-
- ALIGNMENT OUTPUT OPTIONS:
-
- By default, the alignment goes to a file in a self explanatory
- "blocked" alignment format. This format is fine for displaying the
- results but requires heavy editing if you wish to use the alignment
- with other software. To help, we provide 3 other formats which can
- be turned on or off. If you have a sequence data set or alignment
- in memory, you can also ask for output files in whatever formats are
- turned on, NOW. The menu you use to choose format is shown below.
-
- ***
- We draw your attention to NBRF/PIR format in particular. This
- format is EXACTLY the same as one of the input formats. Therefore,
- alignments written in this format can be used again as input (to the
- profile alignments or phylogenetic trees).
- ***
-
-
- ********* Format of Alignment Output *********
-
-
- 1. Toggle CLUSTAL format output = ON
- 2. Toggle NBRF/PIR format output = OFF
- 3. Toggle GCG format output = OFF
- 4. Toggle PHYLIP format output = OFF
-
- 5. Create alignment output file(s) now?
- H. HELP
-
-
- Enter number (or [RETURN] to exit):
-
-
-
- CLUSTAL FORMAT: This is a self explanatory alignment. The
- alignment is written out in blocks. Identities are highlighted and
- (if you use a PAM 250 matrix) positions in the alignment where all
- of the residues are "similar" to each other (PAM 250 score of 8 or
- more) are indicated.
-
- NBRF/PIR FORMAT: This is the usual NBRF/PIR format with gaps
- indicated by hyphens ("-"). AS we have stressed before, this format
- is EXACTLY compatible with the sequence input format. Therefore you
- can read in these alignments again for profile alignments or for
- calculating phylogenetic trees.
-
- GCG FORMAT: In version 7 of the Wisconsin GCG package, a new
- multiple sequence format was introduced. This is the MSF (Multiple
- Sequence Format) format. It can be used as input to the GCG
- sequence editor or any of the GCG programs that make use of multiple
- alignments. THIS FORMAT IS ONLY SUPPORTED IN VERSION 7 OF THE GCG
- PACKAGE OR LATER.
-
- PHYLIP FORMAT: This format can be used by the Phylip package of
- Joe Felsenstein (see the references/algorithms section for details
- of how to get it). Phylip allows you to do a huge range of
- phylogenetic analyses (we just offer one method in this program) and
- is probably the most widely used set of programs for drawing trees.
- It also works on just about every computer you can think of,
- providing you have a decent Pascal compiler.
-
-
-
-
-
- ******************************
- * PROFILE ALIGNMENT MENU. *
- ******************************
-
-
-
- This menu is for taking two old alignments (or single sequences) and
- aligning them with each other. The result is one bigger alignment.
- The menu is very similar to the multiple alignment menu except that
- there is no mention of dendrograms here (they are not needed) and
- you need to input two sets of sequences. The menu looks like this:
-
-
-
- ******Profile*Alignment*Menu******
-
-
- 1. Input 1st. profile/sequence
- 2. Input 2nd. profile/sequence
- 3. Do alignment now
- 4. Alignment parameters
- 5. Output format options
-
- S. Execute a system command
- H. HELP
- or press [RETURN] to go back to main menu
-
-
- Your choice:
-
-
- You must input profile number 1 first. When both profiles are
- loaded, use item 3 (Do alignment now) and the 2 profiles will be
- aligned. Items 4 and 5 (parameters and output options) are
- identical to the equivalent options on the multiple alignment menu.
-
- The same input routines that are used for general input are used
- here i.e. sequences must be in NBRF/PIR, EMBL/SwissProt or FASTA
- format, with gaps indicated by hyphens ("-"). This is why we have
- continualy drawn your attention to the NBRF/PIR format as a useful
- output format.
-
- Either profile can consist of just one sequence. Therefore, if you
- have a favourite alignment of sequences that you are working on and
- wish to add a new sequence, you can use this menu, provided the
- alignment is in the correct format.
-
- The total number of sequences in the two profiles must be less less
- than or equal to the MAXN parameter set in the clustalv.h header
- file.
-
-
-
-
-
-
-
-
-
-
-
- ******************************
- * PHYLOGENETIC TREE MENU. *
- ******************************
-
-
- This menu allows you to input an alignment and calculate a
- phylogenetic tree. You can also calculate a tree if you have just
- carried out a multiple alignment and the alignment is still in
- memory. THE SEQUENCES MUST BE ALIGNED ALREADY!!!!!! The tree will
- look strange if the sequences are not already aligned. You can also
- "BOOTSTRAP" the tree to show confidence levels for groupings. This
- is SLOW on microcomputers but works fine on workstations or
- mainframes.
-
-
-
- ******Phylogenetic*tree*Menu******
-
-
- 1. Input an alignment
- 2. Exclude positions with gaps? = OFF
- 3. Correct for multiple substitutions? = OFF
- 4. Draw tree now
- 5. Bootstrap tree
-
- S. Execute a system command
- H. HELP
- or press [RETURN] to go back to main menu
-
-
- Your choice:
-
-
-
-
- The same input routine that is used for general input is used here
- i.e. sequences must be in NBRF/PIR, EMBL/SwissProt or FASTA format,
- with gaps indicated by hyphens ("-"). This is why we have
- continualy drawn your attention to the NBRF/PIR format as a useful
- output format.
-
- If you have input an alignment, then just use item 4 to draw a tree.
- The method used is the Neighbor Joining method of Saitou and Nei
- (1987). This is a "distance method". First, percent divergence
- figures are calculated between all pairs of sequence. These
- divergence figures are then used by the NJ method to give the tree.
- Example trees will be shown below.
-
- There are two options which can be used to control the way the
- distances are calculated. These are set by options 2 and 3 in the
- menu.
-
- EXCLUDE POSITIONS WITH GAPS? This option allows you to ignore all
- alignment positions (columns) where there is a gap in ANY sequence.
- This guarantees that "like" is compared with "like" in all distances
- i.e. the same positions are used to calculate all distances. It
- also means that the distances will be "metric". The disadvantage of
- using this option is that you throw away much of the data if there
- are many gaps. If the total number of gaps is small, it has little
- effect.
-
- CORRECT FOR MULTIPLE SUBSTITUTIONS? As sequences diverge,
- substitutions accumulate. It becomes increasingly likely that more
- than one substitution (as a result of a mutation) will have happened
- at a site where you observe just one difference now. This option
- allows you to use formulae developed by Motoo Kimura to correct for
- this effect. It has the effect of stretching long branches in tres
- while leaving short ones relatively untouched. The desired effect
- is to try and make distances proportional to time since divergence.
-
- The tree is sent to a file called BLAH.NJ, where BLAH.SEQ is the
- name of the input, alignment file. An example is shown below for 6
- globin sequences.
-
-
-
- DIST = percentage divergence (/100)
- Length = number of sites used in comparison
-
- 1 vs. 2 DIST = 0.5683; length = 139
- 1 vs. 3 DIST = 0.5540; length = 139
- 1 vs. 4 DIST = 0.5315; length = 111
- 1 vs. 5 DIST = 0.7447; length = 141
- 1 vs. 6 DIST = 0.7571; length = 140
- 2 vs. 3 DIST = 0.0897; length = 145
- 2 vs. 4 DIST = 0.1391; length = 115
- 2 vs. 5 DIST = 0.7517; length = 145
- 2 vs. 6 DIST = 0.7431; length = 144
- 3 vs. 4 DIST = 0.0957; length = 115
- 3 vs. 5 DIST = 0.7379; length = 145
- 3 vs. 6 DIST = 0.7361; length = 144
- 4 vs. 5 DIST = 0.7304; length = 115
- 4 vs. 6 DIST = 0.7368; length = 114
- 5 vs. 6 DIST = 0.2697; length = 152
-
-
- Neighbor-joining Method
-
- Saitou, N. and Nei, M. (1987) The Neighbor-joining Method:
- A New Method for Reconstructing Phylogenetic Trees.
- Mol. Biol. Evol., 4(4), 406-425
-
-
- This is an UNROOTED tree
-
- Numbers in parentheses are branch lengths
-
-
- Cycle 1 = SEQ: 5 ( 0.13382) joins SEQ: 6 ( 0.13592)
-
- Cycle 2 = SEQ: 1 ( 0.28142) joins Node: 5 ( 0.33462)
-
- Cycle 3 = SEQ: 2 ( 0.05879) joins SEQ: 3 ( 0.03086)
-
- Cycle 4 (Last cycle, trichotomy):
-
- Node: 1 ( 0.20798) joins
- Node: 2 ( 0.02341) joins
- SEQ: 4 ( 0.04915)
-
-
-
- The output file first shows the percent divergence (distance)
- figures between each pair of sequence. Then a description of a NJ
- tree is given. This description shows which sequences (SEQ:) or
- which groups of sequences (NODE: , a node is numbered using the
- lowest sequence that belongs to it) join at each level of the tree.
-
- This is an unrooted tree!! This means that the direction of
- evolution through the tree is not shown. This can only be inferred
- in one of two ways:
- 1) assume a degree of constancy in the molecular clock and place the
- root (bottom of the tree; the point where all the sequences radiate
- from) half way along the longest branch. **OR**
- 2) use an "outgroup", a sequence from an organism that you "know"
- must be outside of the rest of the sequences i.e. root the tree
- manually, on biological grounds.
-
- The above tree can be represented diagramatically as follows:
-
-
- SEQ 1 SEQ 4
- I I
- 13.6 I 28.1 I 4.9 5.9
- SEQ 6 ----------I I I I--------- SEQ 2
- I I I I
- I--------I-----------I----------I
- 13.4 I 33.5 20.8 2.3 I 3.1
- SEQ 5 ----------I I--------- SEQ 3
-
-
- The figures along each branch are percent divergences along that
- branch. If you root the tree by placing the root along the longest
- branch (33.5%) then you can draw it again as follows, this time
- rooted:
-
-
-
- 13.6
- I-------------------- SEQ 6
- I---------I 13.4
- I I-------------------- SEQ 5
- I 33.5
- -----I 28.1
- I I-------------------- SEQ 1
- I I
- I---------I 4.9
- I 20.8 I----------- SEQ 4
- I--------I
- I 5.9
- I 2.3 I----- SEQ 2
- I-----I 3.1
- I----- SEQ 3
-
-
-
- The longest branch (33.5% between 5,6 and 1,2,3,4) is split between
- the 2 bottom branches of the tree. As it happens in this particular
- case, sequences 5 and 6 are myoglobins while sequences 1,2,3 and 4
- are alpha and beta globins, so you could also justify the above
- rooting on biological grounds. If you do not have any particular
- need or evidence for the position of the root, then LEAVE THE TREE
- UNROOTED. Unrooted trees do not look as pretty as rooted ones but
- it is uaual to leave them unrooted if you do not have any evidence
- for the position of the root.
-
-
- BOTSTRAPPING: Different sets of sequences and different tree
- drawing methods may give different topologies (branching orders) for
- parts of a tree that are weakly supported by the data. It is useful
- to have an indication of the degree of error in the tree. There are
- several ways of doing this, some of them rather technical. We
- provide one general purpose method in this program, which makes use
- of a technique called bootstrapping (see Felsenstein, 1985).
-
- In the case of sequence alignments, bootstrapping involves taking
- random samples of positions from the alignment. If the alignment
- has N positions, each bootstrap sample consists of a random sample
- of N positions, taken WITH REPLACEMENT i.e. in any given sample,
- some sites may be sampled several times, others not at all. Then,
- with each sample of sites, you calculate a distance matrix as usual
- and draw a tree. If the data very strongly support just one tree
- then the sample trees will be very similar to each other and to the
- original tree, drawn without bootstrapping. However, if parts of
- the tree are not well supported, then the sample trees will vary
- considerably in how they represent these parts.
-
- In practice, you should use a very large number of bootstrap
- replicates (1000 is recommended, even if it means running the
- program for an hour on a slow microcomputer; on a workstation it
- will be MUCH faster). For each grouping on the tree, you record the
- number of times this grouping occurs in the sample trees. For a
- group to be considered "significant" at the 95% level (or P <= 0.05
- in statistical terms) you expect the grouping to show up in >= 95%
- of the sample trees. If this happens, then you can say that the
- grouping is significant, given the data set and the method used to
- draw the tree.
-
- So, when you use the bootstrap option, a NJ tree is drawn as before
- and then you are asked to say how many bootstrap samples you want
- (1000 is the default) and you are asked to give a seed number for
- the random number generator. If you give the same seed number in
- future, you will get the same results (we hope). Remember to give
- different seed numbers if you wish to carry out genuinely different
- bootstrap sampling experiments. Below is the output file from using
- the same data for the 6 globin sequences as used before. The output
- file has the same name as the input fike with the extension ".njb".
-
- //
- STUFF DELETED .... same as for the ordinary NJ output
- //
- Bootstrap Confidence Limits
-
-
- Random number generator seed = 99
-
- Number of bootstrap trials = 1000
-
-
- Diagrammatic representation of the above tree:
-
- Each row represents 1 tree cycle; defining 2 groups.
-
- Each column is 1 sequence; the stars in each line show 1 group;
- the dots show the other
-
- Numbers show occurences in bootstrap samples.
-
- ****.. 1000
- .***.. 1000 <- This is the answer!!
- *..*** 812
- 122311
-
-
- For an unrooted tree with N sequences, there are actually only N-3
- genuinely different groupings that we can test (this is the number
- of "internal branches"; each internal branch splits the sequences
- into 2 groups). In this example, we have 6 sequences with 3
- internal branches in the reference tree. In the bootstrap
- resampling, we count how often each of these internal branches
- occur. Here, we find that the branch which splits 1,2,3 and 4
- versus 1 and 2 occurs in all 1000 samples; the branch which splits
- 2,3 and 4 versus 1,5 and 6 occurs in 1000; the branch which splits 2
- and 3 versus 1,4,5 and 6 occurs in 812/1000 samples. We can put
- these figures on to the diagrammatic representation we made earlier
- of our unrooted NJ tree as follows:
-
-
-
- SEQ 1 SEQ 4
- I I
- I I
- SEQ 6 ----------I I I I--------- SEQ 2
- I 1000 I 1000 I 812 I
- I--------I-----------I----------I
- I I
- SEQ 5 ----------I I--------- SEQ 3
-
-
-
- You can equally put these confidence figures on the rooted tree (in
- fact the interpretation is simpler with rooted trees). With the
- unrooted tree, the grouping of sequence 5 with 6 is significant (as
- is the grouping of sequences 1,2,3 and 4). Equally the grouping of
- sequences 1,5 and 6 is significant (the same as saying that 2,3 and
- 4 group significantly). However, the grouping of 2 and 3 is not
- significant, although it is relatively strongly supported.
-
- Unfortunately, there is a small complication in the interpretation
- of these results. In statistical hypothesis testing, it is not
- valid to make multiple simultaneous tests and to treat the result of
- each test completely independantly. In the above case, if you have
- one particular test (grouping) that you wish to make in advance, it
- is valid to test IT ALONE and to simply show the other bootstrap
- figures for reference. If you do not have any particular test in
- mind before you do the bootstrapping, you can just show all of the
- figures and use the 95% level as an ARBITRARY cut off to show those
- groups that are very strongly supported; but not mention anything
- about SIGNIFICANCE testing. In the literature, it is common
- practice to simply show the figures with a tree; they frequently
- speak for themselves.
-
-
-
- *******************************************************************
-
- 4. Command Line Interface.
-
-
-
- You can do almost everything that can be done from the menus, using
- a command line interface. In this mode, the program will take all of
- its instructions as "switches" when you activate it; no questions
- will be asked; if there are no errors, the program just does an
- analysis and stops. It does not work so well on the MAC but is
- still possible. To get you started we will show you the 2 simplest
- uses of the command line as it looks on VAX/VMS. On all other
- machines (except the MAC) it works in the same way.
-
- $ clustalv /help **OR** $ clustalv /check
-
- Both of the above switches give you a one page summary of the
- command line on the screen and then the program stops.
-
-
- $ clustalv proteins.seq **OR** $ clustalv /infile=proteins.seq
-
- This will read the sequences from the file 'proteins.seq' and do a
- complete multiple alignment. Default parameters will be used, the
- program will try to tell whether or not the sequences are DNA or
- protein and the output will go to a file called 'proteins.aln' . A
- dendrogram file called 'proteins.dnd' will also be created. Thus
- the default action for the program, when it successfully reads in an
- input file is to do a full multiple alignment. Some further
- examples of command line usage will be given leter.
-
- Command line switches can be abbreviated but MAKE SURE YOU DO NOT
- MAKE THEM AMBIGUOUS. No attempt will be made to detect ambiguity.
- Use enough characters to distinguish each switch uniquely.
-
-
-
-
-
-
-
- The full list of allowed switches is given below:
-
-
- DATA (sequences)
-
- /INFILE=file.ext :input sequences. If you give an input file and
- nothing else as a switch, the default action is
- to do a complete multiple alignment. The input
- file can also be specified by giving it as the
- first command line parameter with no "/" in
- front of it e.g $ clustalv file.ext .
-
- /PROFILE1=file.ext :You use these two switches to give the names of
- /PROFILE2=file.ext two profiles. The default action is to align
- the two. You must give the names of both profile
- files.
-
-
-
- VERBS (do things)
-
- /HELP :list the command line parameters on the screen.
- /CHECK
-
- /ALIGN :do full multiple alignment. This is the default
- action if no other switches except for input files
- are given.
-
- /TREE :calculate NJ tree. If this is the only action
- specified (e.g. $ clustalv proteins.seq/tree ) it IS
- ASSUMED THAT THE SEQUENCES ARE ALREADY ALIGNED. If
- the sequences are not already aligned, you should
- also give the /ALIGN switch. This will align the
- sequences first, output an alignment file and
- calculate the tree in memory.
-
- /BOOTSTRAP(=n) :bootstrap a NJ tree (n= number of bootstraps;
- default = 1000). If this is the only action
- specified (e.g. $ clustalv proteins.seq/bootstrap )
- it IS ASSUMED THAT THE SEQUENCES ARE ALREADY ALIGNED.
- If the sequences are not already aligned, you should
- also give the /ALIGN switch. This will align the
- sequences first, output an alignment file and
- calculate the bootstraps in memory. You can set the
- number of bootstrap trials here (e.g./bootstrap=500).
- You can set the seed number for the random number
- generator with /seed=n.
-
-
-
- PARAMETERS (set things)
-
- ***Pairwise alignments:***
-
- /KTUP=n :word size
-
- /TOPDIAGS=n :number of best diagonals
-
- /WINDOW=n :window around best diagonals
-
- /PAIRGAP=n :gap penalty
-
-
-
- ***Multiple alignments:***
-
- /FIXEDGAP=n :fixed length gap pen.
-
- /FLOATGAP=n :variable length gap pen.
-
- /MATRIX= :PAM100 or ID or file name. The default weight matrix
- for proteins is PAM 250.
-
- /TYPE=p or d :type is protein or DNA. This allows you to
- explicitely overide the programs attempt at guessing
- the type of the sequence. It is only useful if you
- are using sequences with a VERY strange composition.
-
- /OUTPUT= :GCG or PHYLIP or PIR. The default output is
- Clustal format.
-
- /TRANSIT :transitions not weighted. The default is to weight
- transitions as more favourable than other mismatches
- in DNA alignments. This switch makes all nucleotide
- mismatches equally weighted.
-
-
- ***Trees:***
-
- /KIMURA :use Kimura's correction on distances.
-
- /TOSSGAPS :ignore positions with a gap in ANY sequence.
-
- /SEED=n :seed number for bootstraps.
-
-
-
-
- EXAMPLES:
-
- These examples use the VAX/VMS $ prompt; otherwise, command-line
- usage is the same on all machines except the Macintosh.
-
-
- $ clustalv proteins.seq OR $ clustalv /infile=proteins.seq
-
- Read whatever sequences are in the file "proteins.seq" and do a full
- multiple alignment; output will go to the files: "proteins.dnd"
- (dendrogram) and "proteins.aln" (alignment).
-
-
- $ clustalv proteins.seq/ktup=2/matrix=pam100/output=pir
-
- Same as last example but use K-Tuple size of 2; use a PAM 100
- protein weight matrix; write the alignment out in NBRF/PIR format
- (goes to a file called "proteins.pir").
-
-
- $ clustalv /profile1=proteins.seq/profile2=more.seq/type=p/fixed=11
-
- Take the alignment in "proteins.seq" and align it with "more.seq"
- using default values for everything except the fixed gap penalty
- which is set to 11. The sequence type is explicitely set to
- PROTEIN.
-
-
- $ clustalv proteins.pir/tree/kimura
-
- Take the sequences in proteins.pir (they MUST BE ALIGNED ALREADY)
- and calculate a phylogenetic tree using Kimura's correction for
- distances.
-
-
- $ clustalv proteins.pir/align/tree/kimura
-
- Same as the previous example, EXCEPT THAT AN ALIGNMENT IS DONE
- FIRST.
-
-
- $ clustalv proteins.seq/align/boot=500/seed=99/tossgaps/type=p
-
- Take the sequences in proteins.seq; they are explicitely set to be
- protein; align them; bootstrap a tree using 500 samples and a seed
- number of 99.
-
-
- *******************************************************************
-
- 5. Algorithms and references.
-
-
-
- In this section, we will try to BRIEFLY describe the algorithms used
- in ClustalV and give references. The topics covered are:
-
-
- -Multiple alignments
-
- -Profile alignments
-
- -Protein weight matrices
-
- -Phylogenetic trees
-
- -distances
-
- -NJ method
-
- -Bootstrapping
-
- -Phylip
-
- -References
-
-
-
-
-
-
- MULTIPLE ALIGNMENTS.
-
- The approach used in ClustalV is a modified version of the method of
- Feng and Doolittle (1987) who aligned the sequences in larger and
- larger groups according to the branching order in an initial
- phylogenetic tree. This approach allows a very useful combination
- of computational tractability and sensitivity.
-
- The positions of gaps that are generated in early alignments remain
- through later stages. This can be justified because gaps that arise
- from the comparison of closely related sequences should not be moved
- because of later alignment with more distantly related sequences.
- At each alignment stage, you align two groups of already aligned
- sequences. This is done using a dynamic programming algorithm where
- one allows the residues that occur in every sequence at each
- alignment position to contribute to the alignment score. A Dayhoff
- (1978) PAM matrix is used in protein comparisons.
-
- The details of the algorithm used in ClustalV have been published in
- Higgins and Sharp (1989). This was an improved version of an
- earlier algorithm published in Higgins and Sharp (1988). First, you
- calculate a crude similarity measure between every pair of sequence.
- This is done using the fast, approximate alignment algorithm of
- Wilbur and Lipman (1983). Then, these scores are used to calculate
- a "guide tree" or dendrogram, which will tell the multiple alignment
- stage in which order to align the sequences for the final multiple
- alignment. This "guide tree" is calculated using the UPGMA method
- of Sneath and Sokal (1973). UPGMA is a fancy name for one type of
- average linkage cluster analysis, invented by Sokal and Michener
- (1958).
-
- Having calculated the dendrogram, the sequences are aligned in
- larger and larger groups. At each alignment stage, we use the
- algorithm of Myers and Miller (1988) for the optimal alignments.
- This algorithm is a very memory efficient variation of Gotoh's
- algorithm (Gotoh, 1982). It is because of this algorithm that
- ClustalV can work on microcomputers. Each of these alignments
- consists of aligning 2 alignments, using what we call "profile
- alignments".
-
-
- PROFILE ALIGNMENTS.
-
- We use the term "profile alignment" to describe the alignment of 2
- alignments. We use this term because the method is a simple
- extension of the profile method of Gribskov, et al. (1987) for
- aligning 1 sequence with an alignment. Normally, with a 2 sequence
- alignment, you use a weight matrix (e.g. a PAM 250 matrix) to give a
- score between the pairs of aligned residues. The alignment is
- considered "optimal" if it gives the best total score for aligned
- residues minus penalties for any gaps (insertions or deletions) that
- must be introduced.
-
- Profile alignments are a simple extension of 2 sequence alignments
- in that you can treat each of the two input alignments as single
- sequences but you calculate the score at aligned positions as the
- average weight matrix score of all the residues in one alignment
- versus all those in the other e.g. if you have 2 alignments with I
- and J sequences respectively; the score at any position is the
- average of all the I times J scores of the residues compared
- seperately. Any gaps that are introduced are placed in all of the
- sequences of an alignment at the same position. The profile
- alignments offered in the "profile alignment menu" are also
- calculated in this way.
-
-
- PROTEIN WEIGHT MATRICES.
-
- There are 3 built-in weight matrices used by clustalV. These are
- the PAM 100 and PAM 250 matrices of Dayhoff (1978) and an identity
- matrix. Each matrix is given as the bottom left half, including the
- diagonal of a 20 by 20 matrix. The order of the rows and columns is
- CSTPAGNDEQHRKMILVFYW.
-
-
- PAM 250
-
- C 12
- S 0 2
- T -2 1 3
- P -3 1 0 6
- A -2 1 1 1 2
- G -3 1 0 -1 1 5
- N -4 1 0 -1 0 0 2
- D -5 0 0 -1 0 1 2 4
- E -5 0 0 -1 0 0 1 3 4
- Q -5 -1 -1 0 0 -1 1 2 2 4
- H -3 -1 -1 0 -1 -2 2 1 1 3 6
- R -4 0 -1 0 -2 -3 0 -1 -1 1 2 6
- K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5
- M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6
- I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5
- L -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6
- V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4
- F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9
- Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10
- W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17
- ----------------------------------------------------------------
- C S T P A G N D E Q H R K M I L V F Y W
-
-
- IDENTITY MATRIX
-
- 10
- 0 10
- 0 0 10
- 0 0 0 10
- 0 0 0 0 10
- 0 0 0 0 1 10
- 0 0 0 0 0 0 10
- 0 0 0 0 0 0 0 10
- 0 0 0 0 0 0 0 0 10
- 0 0 0 0 0 0 0 0 0 10
- 0 0 0 0 0 0 0 0 0 0 10
- 0 0 0 0 0 0 0 0 0 0 0 10
- 0 0 0 0 0 0 0 0 0 0 0 0 10
- 0 0 0 0 0 0 0 0 0 0 0 0 0 10
- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
-
-
-
-
-
- PAM 100
-
- 14
- -1 6
- -5 2 7
- -6 1 -1 10
- -5 2 2 1 6
- -8 1 -3 -3 1 8
- -8 2 0 -3 -1 -1 7
- -11 -1 -2 -4 -1 -1 4 8
- -11 -2 -3 -3 0 -2 1 5 8
- -11 -3 -3 -1 -2 -5 -1 1 4 9
- -6 -4 -5 -2 -5 -7 2 -1 -2 4 11
- -6 -1 -4, -2 -5 -8 -3 -6 -5 1 1 10
- -11 -2 -1 -4 -4 -5 1 -2 -2 -1 -3 3 8
- -11 -4 -2 -6 -3 -8 -5 -8 -6 -2 -7 -2 1 13
- -5 -4 -1 -6 -3 -7 -4 -6 -5 -5 -7 -4 4 2 9
- -12 -7 -5 -5 -5 -8 -6 -9 -7 -3 -5 -7 -6 4 2 9
- -4 -4 -1 -4 0 -4 -5 -6 -5 -5 -6 -6 -6 1 5 1 8
- -10 -5 -6 -9 -7 -8 -6 -11 -11 -10 -4 -7-11 -2 0 0 -5 12
- -2 -6 -6 -11 -6 -11 -3 -9 -7 -9 -1-10-10 -8 -4 -5 -6 6 13
- -13 -4 -10 -11 -11 -13 -8 -13 -14 -11 -7 1 -9-11-12 -7-14 -2 -2 19
-
-
-
-
- PHYLOGENETIC TREES.
-
- There are two COMMONLY used approaches for inferring phylogentic
- trees from sequence data: parsimony and distance methods. There are
- other approaches which are probably superior in theory but which are
- yet to be used widely. This does not mean that they are no use; we
- (the authors of this program at any rate) simply do not know enough
- about them yet. You should see the documentation accompanying the
- Phylip package and some of the references there for an explanation
- of the different methods and what assumptions are implied when you
- use them.
-
- There is a constant debate in the literature as to the merits of
- different methods but unfortunately, a lot of what is said is
- incomprehensible or inaccurate. It is also a field that is prone to
- having highly opinionated schools of thought. This is a pity
- because it prevents rational discussion of the pro's and con's of
- the different methods. The approach adopted in ClustalV is to
- supply just one method and to produce alignments in a format that
- can be used by Phylip. In simple cases, the trees produced will be
- as "good" (reliable, robust) as those from ANY other method. In
- more complicated cases, there is no single magic recipe that we can
- supply that will work well in even most situations.
-
- The method we provide is the Neighbor Joining method (NJ) of Saitou
- and Nei (1987) which is a distance method. We use this for three
- reasons: it is conceptually and computationally simple; it is fast;
- it gives "good" trees in simple cases. It is difficult to prove that
- one tree is "better" than another if you do not know the true
- phylogeny; the few systematic surveys of methods show it to work
- more or less as well as any other method ON AVERAGE. Another reason
- for using the NJ method is that it is very commonly used; THIS IS A
- BAD REASON SCIENTIFICALLY but at least you will not feel lonely if
- you use it.
-
- The NJ method works on a matrix of distances (the distance matrix)
- between all pairs of sequence to be analysed. These distances are
- related to the degree of divergence between the sequences. It is
- normal to calculate the distances from the sequences after they are
- multiply aligned. If you calculate them from seperate alignments
- (as done for the dendrograms in another part of this program), you
- may increase the error considerably.
-
-
- DISTANCES
-
- The simplest measure of distance between sequences is percent
- divergence (100% minus percent identity). For two sequences, you
- count how many positions differ between them (ignoring all positions
- with a gap or an unknown residue) and divide by the number of
- positions considered. It is common practice to also ignore all
- positions in the alignment where there is a GAP in ANY of the
- sequences (Tossgaps ? option in the menu). Usually, you express the
- percent distance divided by 100 (gives distances between 0.0 and
- 1.0).
-
- This measure of distance is perfectly adequate (with some further
- modification described below) for rRNA sequences. However it treats
- all residues identically e.g. all amino acid substitutions are
- equally weighted. It also treats all positions identically e.g. it
- does not take account of different rates of substitution in
- different positions of different codons in protein coding DNA
- sequences; see Li et al (1985) for a distance measure that does.
- Despite these shortcomings, these percent identity distances do work
- well in practice in a wide variety of situations.
-
- In a simple world, you would like a distance to be proportional to
- the time since the sequences diverged. If this were EXACTLY true,
- then the calculation of the tree would be a simple matter of algebra
- (UPGMA does this for you) and the branch lengths will be nice and
- meaningful (times). In practice this OBVIOUSLY depends on the
- existence and quality of the "molecular clock", a subject of on-
- going debate. However, even if there is a good clock, there is a
- further problem with estimating divergences. As sequences diverge,
- they become "saturated" with mutations. Sites can have
- substitutions more than once. Calculated distances will
- underestimate actual divergence times; the greater the divergence,
- the greater the discrepancy. There are various methods for dealing
- with this and we provide two commonly used ones, both due to Motoo
- Kimura; one for proteins and one for DNA.
-
-
- For distance K (percent divergence /100 ) ...
-
- Correction for Protein distances: (Kimura, 1983).
-
- Corrected K = -ln(1.0 - K - (K * k/5.0))
-
-
-
- Correction for nucleotide distances: Kimura's 2-parameter method
- (Kimura, 1980).
-
- Corrected K = 0.5*ln(a) + 0.25*ln(b)
-
- where a = 1/(1 - 2*P - Q)
- and b = 1/(1 - 2*Q)
-
- P and Q are the proportions of transitions (A<-->G, C<-->T)
- and transversions occuring between the sequences.
-
-
- One paradoxical effect of these corrections, is that distances can
- be corrected to have more than 100% divergence. That is because,
- for very highly diverged sequences of length N, you can estimate
- that more than N substitutions have occured by correcting the
- observed distance in the above ways. Don't panic!
-
-
-
- NEIGHBOR JOINING TREES.
-
- VERY briefly, the NJ method works as follows. You start by placing
- the sequences in a star topology (no internal branches). You then
- find that internal branch (take 2 sequences; join them; connect them
- to the rest by the internal branch) which when added to the tree
- will minimise the total branch length. The two joined sequences
- (neighbours) are merged into a single sequence and the process is
- repeated. For an unrooted tree with N sequences, there are N-3
- internal branches. The above process is repeated N-3 times to give
- the final tree. The full details are given in Saitou and Nei
- (1987).
-
- As explained elsewhere in the documentation, you can only root the
- tree by one of two methods:
-
- 1) assume a degree of constancy in the molecular clock and place the
- root along the longest branch (internal or external). Methods that
- appear to produce rooted trees automatically are often just doing
- this without letting you know; this is true of UPGMA.
-
- 2) root the tree on biological grounds. The usual method is to
- include an "outgroup", a sequence that you are certain will branch
- to the outside of the tree.
-
-
-
- BOOTSTRAPPING.
-
- Bootstrapping is a general purpose technique that can be used for
- placing confidence limits on statistics that you estimate without
- any knowledge of the underlying distribution (e.g. a normal or
- poisson distribution). In the case of phylogenetic trees, there are
- several analytical methods for placing confidence limits on
- groupings (actually on the internal branches) but these are either
- restricted to particular tree drawing methods or only work on small
- trees of 4 or 5 sequences. Felsenstein (1985) showed how to use
- bootstrapping to calculate confidence limits on trees. His approach
- is completely general and can be applied to any tree drawing method.
- The main assumption of the method in this context is that the sites
- in the alignment are independant; this will be true of some sequence
- alignments (e.g. pseudogenes) but not others (e.g. rRNA's). What
- effect, lack of independance will have on the results is not known.
-
- The method works by taking random samples of data from the complete
- data set. You compute the test statistic (tree in this case) on
- each sample. Variation in the statistic computed from the samples
- gives a measure of variation in the statistic which can be used to
- calculate confidence intervals. Each random sample is the same size
- as the complete data set and is taken WITH REPLACEMENT i.e. a data
- point can be selected more than once (or not at all) in any given
- sample.
-
- In the case of an alignment N residues long, each random sample is a
- random selection of N sites form the alignment. For each sample, we
- calculate a distance matrix and tree in the usual way. Variation in
- the sample trees compared to a tree calculated from the full data
- set gives an indication of how well supported the tree is by the
- data. If the sample trees are very similar to each other and to the
- full tree, then the tree is "strongly" supported; if the sample
- trees show great variation, then the tree will be weakly supported.
- In practice, you usually find some parts of a tree well supported,
- others weakly. This can be seen by counting how often each
- monophyletic group in the full tree occurs in the sample trees.
-
- For a particular grouping, one considers it to be significant at the
- 95% level (P <= 0.05) if it occurs in 95% of the bootstrap samples.
- If a grouping is significant, it is significant with respect to the
- particular data set and method used for drawing the tree.
- Biological "significance" is another matter.
-
-
- PHYLIP.
-
- The Phylip package was written by Joe Felsenstein, University of
- Washington, USA. It provides Pascal source code for a large number
- of programs for doing most types of phylogenetic analyses. The
- Phylip format alignments produced by this program can be used by all
- of the Phylip programs, version 3.4 or later (March 1991). It is
- freely available from him as follows.
-
-
-
- ================= PHYLIP information sheet =====================
-
- PHYLIP - Phylogeny Inference Package (version 3.3)
-
- This is a FREE package of programs for inferring phylogenies and
- carrying out certain related tasks. At present it contains 28
- programs, which carry out different algorithms on different kinds of
- data. The programs in the package are:
-
- ---------- Programs for molecular sequence data ----------
- PROTPARS Protein parsimony
- DNAPARS Parsimony method for DNA
- DNAMOVE Interactive DNA parsimony
- DNAPENNY Branch and bound for DNA
- DNABOOT Bootstraps DNA parsimony
- DNACOMP Compatibility for DNA
- DNAINVAR Phylogenetic invariants
- DNAML Maximum likelihood method
- DNAMLK DNAML with molecular clock
- DNADIST Distances from sequences
- RESTML ML for restriction sites
-
- ----------- Programs for distance matrix data ------------
- FITCH Fitch-Margoliash and least-squares methods
- KITSCH Fitch-Margoliash and least squares methods with
- evolutionary clock
-
- --- Programs for gene frequencies and continuous characters --
- CONTML Maximum likelihood method
- GENDIST Computes genetic distances
-
- ------------- Programs for discrete state data -----------
- MIX Wagner, Camin-Sokal, and mixed parsimony criteria
- MOVE Interactive Wagner, C-S, mixed parsimony program
- PENNY Finds all most parsimonious trees by branch-and-bound
- BOOT Bootstrap confidence interval on mixed parsimony methods
- DOLLOP, DOLMOVE, DOLPENNY, DOLBOOT same as preceding four
- programs, but for the Dollo and polymorphism parsimony
- criteria
- CLIQUE Compatibility method
- FACTOR recode multistate characters
-
- ---- Programs for plotting trees and consensus trees ----
- DRAWGRAM Draws cladograms and phenograms on screens, plotters and
- printers
- DRAWTREE Draws unrooted phylogenies on screens, plotters and
- printers
- CONSENSE Majority-rule and strict consensus trees
-
- The package includes extensive documentation files that provide the
- information necessary to use and modify the programs.
-
- COMPATIBILITY: The programs are written in a very standard subset of
- Pascal, a language that is available on most computers (including
- microcomputers). The programs require only trivial modifications to
- run on most machines: for example they work with only minor
- modifications with Turbo Pascal, and without modifications on VAX
- VMS Pascal. Pascal source code is distributed in the regular version
- of PHYLIP: compiled object code is not. To use that version, you
- must have a Pascal compiler.
-
- DISKETTE DISTRIBUTION: The package is distributed in a variety of
- microcomputer diskette formats. You should send FORMATTED
- diskettes, which I will return with the package written on them.
- Unfortunately, I cannot write any Apple formats. See below for how
- many diskettes to send. The programs on the magnetic tape or
- electronic network versions may of course also be moved to
- microcomputers using a terminal program.
-
- PRECOMPILED VERSIONS: Precompiled executable programs for PCDOS
- systems are available from me. Specify the "PCDOS executable
- version" and send the number of extra diskettes indicated below.
- An Apple Macintosh version with precompiled code is available from
- Willem Ellis, Instituut voor Taxonomische Zoologie, Zoologisch
- Museum, Universiteit van Amsterdam, Plantage Middenlaan 64, 1018DH
- Amsterdam, Netherlands, who asks that you send 5 800K diskettes.
-
- HOW MANY DISKETTES TO SEND: The following table shows for different
- PCDOS formats how many diskettes to send, and how many extra
- diskettes to send for the PCDOS executable version:
-
- Diskette size Density For source code For executables, send
- in addition
- 3.5 inch 1.44 Mb 2 1
- 5.25 inch 1.2 Mb 2 2
- 3.5 inch 720 Kb 4 2
- 5.25 inch 360 Kb 7 4
-
- Some other formats are also available. You MUST tell me EXACTLY
- which of these formats you need. The diskettes MUST be formatted by
- you before being sent to me. Sending an extra diskette may be
- helpful.
-
- NETWORK DISTRIBUTION: The package is also available by distribution
- of the files directly over electronic networks, and by anonymous ftp
- from evolution.genetics.washington.edu. Contact me by electronic
- mail for details.
-
- TAPE DISTRIBUTION: The programs are also distributed on a magnetic
- tape provided by you (which should be a small tape and need only be
- able to hold two megabytes) in the following format: 9-track, ASCII,
- odd parity, unlabelled, 6250 bpi (unless otherwise indicated).
- Logical record: 80 bytes, physical record: 3200 bytes (i.e. blocking
- factor 40). There are a total of 71 files. The first one describes
- the contents of the package.
-
- POLICIES: The package is distributed free. I do not make it
- available or support it in South Africa. The package will be
- written on the diskettes or tape, which will be mailed back. They
- can be sent to:
-
- Joe Felsenstein
- Electronic mail addresses: Department of Genetics SK-50
- Internet: joe@genetics.washington.edu University of Washington
- Bitnet/EARN: felsenst@uwavm Seattle, Washington 98195
- UUCP: uw-beaver!evolution.genetics!joe U.S.A.
-
-
- ===================== End of Phylip Info. Sheet ====================
-
-
-
-
- REFERENCES.
-
- Dayhoff, M.O., Schwartz, R.M. and Orcutt, B.C. (1978) in Atlas of
- Protein Sequence and Structure, Vol. 5 supplement 3, Dayhoff, M.O.
- (ed.), NBRF, Washington, p. 345.
-
- Felsenstein, J. (1985) Confidence limits on phylogenies: an
- approach using the bootstrap. Evolution 39, 783-791.
-
- Feng, D.-F. and Doolittle, R.F. (1987) Progressive sequence
- alignment as a prerequisite to correct phylogenetic trees.
- J.Mol.Evol. 25, 351-360.
-
- Gotoh, O. (1982) An improved algorithm for matching biological
- sequences. J.Mol.Biol. 162, 705-708.
-
- Gribskov, M., McLachlan, A.D. and Eisenberg, D. (1987) Profile
- analysis: detection of distantly related proteins. PNAS USA 84,
- 4355-4358.
-
- Higgins, D.G. and Sharp, P.M. (1988) CLUSTAL: a package for
- performing multiple sequence alignments on a microcomputer. Gene
- 73, 237-244.
-
- Higgins, D.G. and Sharp, P.M. (1989) Fast and sensitive multiple
- sequence alignments on a microcomputer. CABIOS 5, 151-153.
-
- Kimura, M. (1980) A simple method for estimating evolutionary
- rates of base substitutions through comparative studies of
- nucleotide sequences. J. Mol. Evol. 16, 111-120.
-
- Kimura, M. (1983) The Neutral Theory of Molecular Evolution.
- Cambridge University Press, Cambridge, England.
-
- Li, W.-H., Wu, C.-I. and Luo, C.-C. (1985) A new method for
- estimating synonymous and nonsynonymous rates of nucleotide
- substitution considering the relative likelihood of nucleotide and
- codon changes. Mol.Biol.Evol. 2, 150-174.
-
- Myers, E.W. and Miller, W. (1988) Optimal alignments in linear
- space. CABIOS 4, 11-17.
-
- Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological
- sequence comparison. PNAS USA 85, 2444-2448.
-
- Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new
- method for reconstructing phylogenetic trees. Mol.Biol.Evol. 4,
- 406-425.
-
- Sneath, P.H.A. and Sokal, R.R. (1973) Numerical Taxonomy. Freeman,
- San Francisco.
-
- Sokal, R.R. and Michener, C.D. (1958) A statistical method for
- evaluating systematic relationships. Univ.Kansas Sci.Bull. 38,
- 1409-1438.
-
- Vingron, M. and Argos, P. (1991) Motif recognition and alignment
- for many sequences by comparison of dot matrices. J.Mol.Biol. 218,
- 33-43.
-
- Wilbur, W.J. and Lipman, D.J. (1983) Rapid similarity searches of
- nucleic acid and protein data banks. PNAS USA 80, 726-730.
-
-